Usually when loading files (big files) into Hive tables static
partitions are preferred. That saves your time in loading data compared
to dynamic partition. You "statically" add a partition in table and move
the file into the partition of the table. Since the files are big they
are usually generated in HDFS. You can get the partition column value
form the filename, day of date etc without reading the whole big file.In static partitioning, we need to specify the partition column value in each and every LOAD statement.
Hive Static Partitioning
- Insert input data files individually into a partition table is Static Partition.
- Usually when loading files (big files) into Hive tables static partitions are preferred.
- Static Partition saves your time in loading data compared to dynamic partition.
- You “statically” add a partition in the table and move the file into the partition of the table.
- We can alter the partition in the static partition.
- You can get the partition column value from the filename, day of date etc without reading the whole big file.
- If you want to use the Static partition in the hive you should set property set hive.mapred.mode = strict This property set by default in hive-site.xml
- Static partition is in Strict Mode.
- You should use where clause to use limit in the static partition.
- You can perform Static partition on Hive Manage table or external table.
Partitioning data is often used for distributing load horizontally, this
has performance benefit,and helps in organizing data in a logical
fashion. Example: if we are dealing with a large employee table and
often run queries with WHERE clauses that restrict the results to a
particular country or department. For a faster query response Hive table
can be PARTITIONED BY (country string,dept string).Partitioning tables
changes how Hive structures the data storage and Hive will now create
subdirectories reflecting the partitioning structure like
.../employees/country=ABC/DEPT=XYZ.
If query limits for employee from country=ABC, it will only scan the contents of one directory country=ABC. This can dramatically improve query performance, but only if the partitioning scheme reflects common filtering. Partitioning feature is very useful in Hive, however, a design that creates too many partitions may optimize some queries, but be detrimental for other important queries. Other drawback is having too many partitions is the large number of Hadoop files and directories that are created unnecessarily and overhead to NameNode since it must keep all metadata for the file system in memory.
Create Partitioned Table
.../employees/country=ABC/DEPT=XYZ.
If query limits for employee from country=ABC, it will only scan the contents of one directory country=ABC. This can dramatically improve query performance, but only if the partitioning scheme reflects common filtering. Partitioning feature is very useful in Hive, however, a design that creates too many partitions may optimize some queries, but be detrimental for other important queries. Other drawback is having too many partitions is the large number of Hadoop files and directories that are created unnecessarily and overhead to NameNode since it must keep all metadata for the file system in memory.
Create Partitioned Table
create table user
(
firstname VARCHAR(64),
lastname VARCHAR(64),
address STRING,
city VARCHAR(64),
post STRING,
phone1 VARCHAR(64),
phone2 STRING,
email STRING,
web STRING
)
PARTITIONED BY (country VARCHAR(64), state VARCHAR(64))
row format delimited fields terminated by '\t'
LINES TERMINATED BY '\n' STORED AS TEXTFILE
(
firstname VARCHAR(64),
lastname VARCHAR(64),
address STRING,
city VARCHAR(64),
post STRING,
phone1 VARCHAR(64),
phone2 STRING,
email STRING,
web STRING
)
PARTITIONED BY (country VARCHAR(64), state VARCHAR(64))
row format delimited fields terminated by '\t'
LINES TERMINATED BY '\n' STORED AS TEXTFILE
Data File Data
To she see the table Partition
Show partitions Table Name
Show partitions Table Name
Load Data into Table
LOAD DATA LOCAL INPATH '/home/cloudera/Desktop/data/Hive_Data/sampl_data1.txt' INTO TABLE USER_SP PARTITION (country = 'US', state = 'CA');
INSERT OVERWRITE TABLE USER_SP PARTITION (country = 'US', state = 'NY') SELECT Firstname,Lastname,Address,City,Postal Code,Phone1,Phone2,Email,Web FROM USER WHERE country = 'US' AND state = 'NY';
Internally, the data for each partition will be stored as separate files under separate sub-directories. We can see the physical storage by going to the HDFS location of the table.
hadoop fs -ls /apps/hive/warehouse/db_name.db/USER/country=US/state=NY/
LOAD DATA LOCAL INPATH '/home/cloudera/Desktop/data/Hive_Data/sampl_data1.txt' INTO TABLE USER_SP PARTITION (country = 'US', state = 'CA');
INSERT OVERWRITE TABLE USER_SP PARTITION (country = 'US', state = 'NY') SELECT Firstname,Lastname,Address,City,Postal Code,Phone1,Phone2,Email,Web FROM USER WHERE country = 'US' AND state = 'NY';
Internally, the data for each partition will be stored as separate files under separate sub-directories. We can see the physical storage by going to the HDFS location of the table.
hadoop fs -ls /apps/hive/warehouse/db_name.db/USER/country=US/state=NY/
How to get data from specific partition in Hive table?
show partitions TABLENAME
pt=2012.07.28.08/is_complete=1
pt=2012.07.28.09/is_complete=1
pt=2012.07.28.10/is_complete=1
pt=2012.07.28.11/is_complete=1
select * from TABLENAME where pt='2012.07.28.10/is_complete=1' limit 1;
pt=2012.07.28.08/is_complete=1
pt=2012.07.28.09/is_complete=1
pt=2012.07.28.10/is_complete=1
pt=2012.07.28.11/is_complete=1
select * from TABLENAME where pt='2012.07.28.10/is_complete=1' limit 1;
Note: In Static partition some risky queries are not allowed to run. They include:
- Cartesian Product.
- No partition being picked up for a query.
- Comparing bigints and strings.
- Comparing bigints and doubles.
- Orderby without limit.
No comments:
Post a Comment